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A  INTRODUCTION  MATTHEW  J .  KEEPER 

Chief,  Technical  Inf ormation  Division 

Figure  1  indicates  how  the  single  topic  of  general  sparse  matrix  solution  using 
scalar  processors  may  be  broken  into  specialized  areas  of  study  when  implementation 
on  vector  architectures  is  considered. 

First,  highly  sparse  matrices,  usually  representing  ODE/algebraic-modeled  sys¬ 
tems,  are  easily  decoupled  by  re-ordering.  At  a  minimum,  locally-decoupled  equations 
may  be  solved  in  pipelined  scalar  mode  (see  below);  if  the  decoupled  subsystems  can 
be  arranged  (a)  to  have  identical  sparsity,  and  (b)  to  be  stored  a  constant  stride  apart, 
then  a  simultaneous  sparse  solver  [7]  may  be  invoked  and  a  vector  solution  obtained. 

As  sparse  systems  become  locally  coupled  -  as  occurs  in  finite  element  and  finite 
difference  problems  -  then  vectors  are  easily  defined  within  the  coupled  subsystems.  It 
is  worth  making  a  further  distinction  between 

(a)  intra-nodal  or  intra-element  coupling,  where  the  dimension  of  dense  sub¬ 
matrices  is  proportional  to  the  number  of  unknowns /node  or  unknowns /finite 
element,  and 

(b)  inter-nodal  or  inter-element,  where  the  coupling  between  grid  nodes  or 
finite  elements  determines  the  vector  length. 

Banded  and  profile  matrices  result  from  the  latter.  The  associated  vector  lengths 
are  the  products  of  the  number  of  unknown/node  (element)  and  the  number  of  coupled 
nodes.  These  lengths  are  therefore  always  longer  than  in  the  former  case,  so  that  com¬ 
mon  bandsolvers  offer  the  highest  performance  of  any  sparse  solvers. 

In  previous  research,  algorithms  and  CRAY-1  software  have  been  developed  on  this 
grant  for  (Figure  1) 

(a)  general  sparse  matrices  [10], 

(b)  patterned  sparse  matrices,  in  conjunction  with  a  vectorized  electronic 


circuit  analysis  program  [7][ll],  and 

(c)  blocked  matrices  arising  from  intra-nodal  coupling  [0]. 


In  addition,  Duff  and  Reid  [9]  have  converted  a  Fortran  frontal  ("intra-element")  solver 


to  the  CRAY-1. 


B.  PROGRESS 
1.  SPARSE  MATRICES 

Referring  to  Figure  1  again,  (a)-(c)  above  leave 

(d)  unpatterned  highly  sparse  matrices,  and 

(e)  banded  and  profile  matrices 

for  current  study. 
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UNPATTERNED 

A  highly-  and  randomly-sparse  matrix  necessitates  a  scalar  solution.  Even  so,  an 
ordering  (in  the  spirit  of  nested  dissection)  can  be  found  such  that  the  LU  factors  can 
be  written  in  the  form 
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where  Du  is  a  diagonal  matrix  and  Ly  and  Uji  extend  from  Du  to  the  matrix  boundary. 
The  ordered  steps  to  reduce  the  rth  pivot  block  are 


Drr 

<- 

Drr1 

(reciprocation) 

(1) 

Urr+X 

<- 

ArA-j+l 

(multiplication) 

(2) 

<- 

4r+lx+l4+l,r  ^r,r+l 

(multi,  /subtraction) 

(3) 

where  represents  the  unreduced  southeast  corner  of  the  matrix  at  the  rth 

reduction  step.  These  steps  can  be  performed  in  three  parallel  or  pipelined  steps. 

Because  all  of  the  above  operations  are  on  blocks  with  random  sparsity  and 
storage,  they  must  be  performed  in  scalar  mode.  We  have  elected  to  achieve  the 
highest  speed  by  generating  loopless  scalar  solution  code  in  the  manner  of  Gustavson 
[12]..  although  this  limits  the  matrix  size  to  perhaps  5,000  highly-sparse  equations. 
Rates  in  the  range  of  15  MFLOPS  on  the  CRAY-1  are  readily  achievable,  a  speedup  of  5:1 
to  10:1  over  other  implementations.  A  report  and  a  paper  are  currently  in  preparation 
on  these  equation  ordering  and  implementation  methods. 

BANDED/PROFILE 

It  has  become  popular  among  researchers  recently  studying  the  solution  of  2-D 
grids  by  general  sparse  solvers  on  the  CRAY-1  to  quibble  about  the  quality  of  software 
operating  in  the  range  of  5-20  MFLOPS.  Yet  it  is  well  known  that  banded  solution  of 
such  grids  is  possible  in  the  range  of  90  MFLOPS  [13].  It  was  therefore  deemed  desir¬ 
able  to  develop  a  high-performance  bandsolver  as  a  standard  for  comparison  with 
lower-performance  general  sparsity  software.  General  sparsity  software  would  then  be 
useful  only  if  the  reduced  operation  counts  associated  with  optimal  ordering  could 
compensate  for  the  higher  performance  of  a  banded  solver.  This  bandsolver  was  com¬ 
pleted  in  the  Fall  of  1961.  Careful  CAL  coding  using  a  CRAY-1  simulator  resulted  in  a 
1.3:1  to  2:1  speedups  over  the  best  previous  (Los  Alamos)  coding,  to  over  117  MFLOPS. 

It  is  well-known  that  factors  of  up  to  four  in  operation  count  may  be  achieved  in 
banded  solutions  of  common  irregular  grid  structures  by  following  the  bandwidth  pro¬ 
file.  Unfortunately,  the  pointers  to  describe  this  profile  constitute  a  serious  overhead 
in  the  inner  loops  of  the  vectorized  reduction  algorithm.  It  was  therefore  decided  to 
block  the  profile  parallel  to  the  diagonal  (Figure  2).  Diagonal  blocking  is  natural  to  2-D 


grid  solution,  and  so  introduces  few  extra  computations;  it  also  migrates  the  symbolic 
pointers  to  the  outer  loops,  since  block  descriptors  point  at  large  matrix  substruc¬ 
tures.  Overall,  this  appears  to  be  the  best  compromise  between  the  operation  count 
efficiency  of  general  sparsity  methods  and  the  vectorizability  of  banded  solution. 

Software  has  been  developed  for  the  CRAY-1  to  solved  banded  and  profile  systems 
[3][5]  and  has  been  included  in  a  library  of  sparse  equation  solvers  classified  in  Figure 
1  and  directed  at  general  and  specialized  sparsity  structures.  (These  solvers  will  be 
presented  at  a  forthcoming  sparse  matrix  conference  in  Fall,  1962.)  A  symmetric 
matrix  version  of  this  software  is  intended  to  be  developed  with  joint  AFFDL  support;  it 
will  be  used  to  solve  optimization  problems  in  the  structural  aspects  of  wing  design. 
Also,  optimal  blocking  strategies  are  being  studied,  based  on  a  timing  model 
(developed  using  a  CRAY-1  simulator)  of  the  solution  code. 

2.  OTHER  PROGRESS 

SCHEDULING 

It  was  observed  during  the  development  of  high  performance  equation  solvers  for 
the  CRAY-1  that  optimality  of  the  implementation  could  not  be  guaranteed.  It 
appeared,  however,  that  the  CRAY-1  architecture  could  be  described  as  a  mathematical 
programming  problem.  Consistent  with  our  investment  in  other  program  development 
aids,  this  optimizer  is  being  developed  into  a  useful  package  and  is  apparently  of 
interest  to  others  with  desire  for  truly  optimal  codings.  Moreover,  a  related  confer¬ 
ence  presentation  has  been  made  [3],  a  journal  manuscript  is  being  prepared,  and  a 
Ph.D.  thesis  is  being  written  on  this  work. 


ELECTRONIC  CIRCUIT  ANALYSIS 


Because  our  multi-level  sparse  matrix  algorithms  are  the  basis  of  the  Berkeley 
effort  to  vectorize  their  popular  SPICE  electronic  circuit  analysis  program,  it  is  worth 
reporting  that  their  AFOSR-funded  project  is  producing  speedups  of  8:1  over  the  origi¬ 
nal  scalar  code  for  "small”  (268-transistor)  circuits.  Larger  speedups  are  expected 
with  larger  circuits  that  yield  longer  vectors.  A  preliminary  version  of  the  revised 
17000-statement  program  is  expected  to  be  released  this  summer. 

C.  COUPLING  ACTIVITIES 

1.  SEMINARS 

A  review  of  our  vector  processing  research  was  presented  at  the  AFFDL. 

2.  CONSULTING 

(a)  Visiting  scientist,  AFFDL,  to  vectorized  an  explicit  Navier-Stokes  codes  on 
the  CYBER  205  and  to  study  any  associated  I/O  problems  (-9/30/82). 

(b)  Visiting  scientist,  LANL,  on  vectorized  Monte  Carlo  (5/1/81-4/30/82). 

(c)  Industrial  consultant,  Mobil  Research  and  Development,  on  supercomputer 
procurement  evaluation  (5/1/81-1/15/82). 

(d)  Industrial  consultant,  Chevron  Oil  Field  Research  Co.,  on  organization  of 
vectorized  sparse  matrix  algorithms  (2/82). 

3.  OTHER 

(a)  A  one-week  short  course  at  the  University  was  presented  in  August,  1981, 
on  High  Speed  Computation. 

(b)  Evaluation  of  proposals  for  the  NASA  Numerical  Aerodynamic  Simulator 
(NAS)  was  initiated,  as  an  appointed  member  of  a  Technical  Review  Board 
(4/1/82) 
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ITEM  #20,  CONTINUED:  ^(b)  tp  be  stored  a  constant  stride  apart,  then  a 
simultaneous  sparse  solver  [7] 'may  be  invoked  and  a  vector  solution  obtained. 

As  sparse  systems  become  locally  coupled  -  as  occurs  in  finite  element  and 
finite  difference  problems  -  then  vectors  are  easily  defined  within  the 
coupled  subsystems.  It  is  worth  making  a  further  distinction  between; 

(a)  intra-nodal  or  intra-element  coupling,  where  the  dimension  of 
dense  submatrices  is  proportional  to  the  number  of  unknowns/node  or 
unknowns/ finite  element,  and 

v(b)  inter-nodal  or  inter-element,  where  the  coupling  between  grid 
nodes  or  finite  elements  determines  the  vector  length. 

Banded  and  profile  matrices  result  from  the  latter.  The  associated  vector' 
lengths  are  the  products  of  the  number  of  unknown/node  (element)  and  the 
number  of  coupled  nodes.  These  lengths  are  therefore  always  longer  than  in 
the  former  case,  so  that  common  bandsolvers  offer  the  highest  performance  of 
any  sparse  solvers 

•? 

In  previous  research,  algorithms  and  CRAY-1  software  have  been  developed  on 
this  grant  for  (Figure  1) 

(a)  general  sparse  matrices  [10], 

(b)  patterned  sparse  matrices,  in  conjunction  with  a  vectorized 
electronic  circuit  analysis  program  [7][li],  and 

(c)  blocked  matrices  arising  from  intra-nodal  coupling  [B]. 

In  addition,  Duff  and  Reid  [9]  have  converted  a  Fortran  frontal  ("intra- 
element")  solver  to  the  CRAY-1. 
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